We present NusaCrowd, a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and its local languages. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and its local languages. Our work is intended to help advance natural language processing research in under-represented languages.
translated by 谷歌翻译
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License.
translated by 谷歌翻译
任务自适应预训练(TAPT)减轻了缺乏标记的数据,并通过将未标记的数据调整为下游任务来提供性能提升。不幸的是,现有的改编主要涉及不能很好地概括的确定性规则。在这里,我们提出了Clozer,这是一种基于TAPT中使用的基于序列的固定答案提取方法,可扩展,以适应任何固定的机器读数理解理解(MRC)下游任务。我们在多项选择披肩风格的MRC任务上进行了实验,并证明与Oracle和最先进的TAPT在提升模型性能中的效果相比,Clozer的性能要好得多,并证明Clozer能够识别Gold独立于任何启发式方法的答案。
translated by 谷歌翻译
Natural Language Generation (NLG) has improved exponentially in recent years thanks to the development of sequence-to-sequence deep learning technologies such as Transformer-based language models. This advancement has led to more fluent and coherent NLG, leading to improved development in downstream tasks such as abstractive summarization, dialogue generation and data-to-text generation. However, it is also apparent that deep learning based generation is prone to hallucinate unintended text, which degrades the system performance and fails to meet user expectations in many real-world scenarios. To address this issue, many studies have been presented in measuring and mitigating hallucinated texts, but these have never been reviewed in a comprehensive manner before. In this survey, we thus provide a broad overview of the research progress and challenges in the hallucination problem in NLG. The survey is organized into two parts: (1) a general overview of metrics, mitigation methods, and future directions; and (2) an overview of task-specific research progress on hallucinations in the following downstream tasks, namely abstractive summarization, dialogue generation, generative question answering, data-to-text generation, machine translation, and visual-language generation. This survey serves to facilitate collaborative efforts among researchers in tackling the challenge of hallucinated texts in NLG.
translated by 谷歌翻译
常识性因果关系推理(CCR)旨在确定普通人认为合理的自然语言描述中合理的原因和影响。尽管缺乏良好的理论框架,但仍然存在很大的学术和实践兴趣,但仍然受到了这个问题的影响。现有的工作通常全心全意地依赖于深层语言模型,并且可能容易受到混淆的共发生。由经典因果原则的促进,我们表达了CCR的主要问题,并在观察性研究和自然语言中与人类受试者之间的相似之处,以采用CCR对潜在的遇到框架,这是第一次进行常识任务的尝试。我们提出了一个新颖的框架岩石,以推理o(a)回合常识性k(c)技术,该仪式利用时间信号作为偶然的监督,并使用类似于倾向得分的时间倾向来混淆效果。岩石实施是模块化的,零射,并且表现出良好的CCR功能。
translated by 谷歌翻译
最近,端到端(E2E)框架在各种自动语音识别(ASR)任务上取得了显着的结果。但是,无格的最大互信息(LF-MMI),作为在混合ASR系统中显示出卓越性能的鉴别性培训标准之一,很少在E2E ASR框架中采用。在这项工作中,我们提出了一种新的方法,将LF-MMI标准集成到培训和解码阶段的E2E ASR框架中。该方法显示了其在两个最广泛使用的E2E框架上的有效性,包括基于注意的编码器解码器(AED)和神经传感器(NTS)。实验表明,LF-MMI标准的引入始终如一地导致各种数据集和不同E2E ASR框架的显着性能改进。我们最好的模型在Aishell-1开发/测试集上实现了4.1 \%/ 4.4 \%的竞争力;我们还在强大的基线上实现了对Aishell-2和Librispeech数据集的显着误差。
translated by 谷歌翻译
基于对专家的声音模型,具有动态路由机制已经证明了语音识别的有希望的结果。路由器架构的设计原理对于大型型号容量和高计算效率很重要。我们以前的工作Speepmoe仅使用本地图形嵌入嵌入来帮助路由器进行路由决策。为了进一步提高语音识别性能,反对不同的域和重音,我们提出了一种新的路由器架构,该架构将额外的全局域和重点嵌入路由器输入以促进适应性。实验结果表明,所提出的Speepmoe2可以实现比较参数的较低字符的误差率(CER),而不是多域和多重点任务上的Spearmmoe。主要是,拟议的方法分别提供多元域任务的相对12.8%的相对元改善,分别为多重点任务的相对经济增长1.9%-17.7%。此外,增加专家人数也取得了一致的性能改进,并保持计算成本不变。
translated by 谷歌翻译
在文本到语音(TTS)综合中的语音克隆几次拍摄样式转移的任务旨在使用非常有限的中性数据将任意源扬声器的讲话方式转移到目标扬声器的语音。这是一个非常具有挑战性的任务,因为学习算法需要同时处理几次拍摄的语音克隆和扬声器洛喻解除术。加速新的目标扬声器的适应过程在现实世界应用中具有重要性,但更具挑战性。在本文中,我们使用元学习方法探讨语音克隆任务的艰难少量拍摄方式。我们调查模型 - 不可知的元学习(MAML)算法和Meta-Transfer将预先训练的多扬声器和多韵律基础TTS模型进行高度敏感,适应少量样品。域反对派培训机制和正交约束被采用解散扬声器和韵律思想,以实现有效的跨州式转移。实验结果表明,该方法能够使用来自目标扬声器的5个样本(大约12个语音数据)进行快速的语音克隆,只有100个适配步骤。音频样本可在线获取。
translated by 谷歌翻译
Exploiting rich linguistic information in raw text is crucial for expressive text-to-speech (TTS). As large scale pre-trained text representation develops, bidirectional encoder representations from Transformers (BERT) has been proven to embody semantic information and employed to TTS recently. However, original or simply fine-tuned BERT embeddings still cannot provide sufficient semantic knowledge that expressive TTS models should take into account. In this paper, we propose a word-level semantic representation enhancing method based on dependency structure and pre-trained BERT embedding. The BERT embedding of each word is reprocessed considering its specific dependencies and related words in the sentence, to generate more effective semantic representation for TTS. To better utilize the dependency structure, relational gated graph network (RGGN) is introduced to make semantic information flow and aggregate through the dependency structure. The experimental results show that the proposed method can further improve the naturalness and expressiveness of synthesized speeches on both Mandarin and English datasets.
translated by 谷歌翻译
Reading comprehension of legal text can be a particularly challenging task due to the length and complexity of legal clauses and a shortage of expert-annotated datasets. To address this challenge, we introduce the Merger Agreement Understanding Dataset (MAUD), an expert-annotated reading comprehension dataset based on the American Bar Association's 2021 Public Target Deal Points Study, with over 39,000 examples and over 47,000 total annotations. Our fine-tuned Transformer baselines show promising results, with models performing well above random on most questions. However, on a large subset of questions, there is still room for significant improvement. As the only expert-annotated merger agreement dataset, MAUD is valuable as a benchmark for both the legal profession and the NLP community.
translated by 谷歌翻译